Worksheet - Answer - DGA Detection using Machine Learning

This worksheet is a step-by-step guide on how to detect domains that were generated using "Domain Generation Algorithm" (DGA). We will walk you through the process of transforming raw domain strings to Machine Learning features and creating a decision tree classifer which you will use to determine whether a given domain is legit or not. Once you have implemented the classifier, the worksheet will walk you through evaluating your model.

Overview 2 main steps:

  1. Feature Engineering - from raw domain strings to numeric Machine Learning features using DataFrame manipulations
  2. Machine Learning Classification - predict whether a domain is legit or not using a Decision Tree Classifier

DGA - Background

"Various families of malware use domain generation algorithms (DGAs) to generate a large number of pseudo-random domain names to connect to a command and control (C2) server. In order to block DGA C2 traffic, security organizations must first discover the algorithm by reverse engineering malware samples, then generate a list of domains for a given seed. The domains are then either preregistered, sink-holed or published in a DNS blacklist. This process is not only tedious, but can be readily circumvented by malware authors. An alternative approach to stop malware from using DGAs is to intercept DNS queries on a network and predict whether domains are DGA generated. Much of the previous work in DGA detection is based on finding groupings of like domains and using their statistical properties to determine if they are DGA generated. However, these techniques are run over large time windows and cannot be used for real-time detection and prevention. In addition, many of these techniques also use contextual information such as passive DNS and aggregations of all NXDomains throughout a network. Such requirements are not only costly to integrate, they may not be possible due to real-world constraints of many systems (such as endpoint detection). An alternative to these systems is a much harder problem: detect DGA generation on a per domain basis with no information except for the domain name. Previous work to solve this harder problem exhibits poor performance and many of these systems rely heavily on manual creation of features; a time consuming process that can easily be circumvented by malware authors..."
[Citation: Woodbridge et. al 2016: "Predicting Domain Generation Algorithms with Long Short-Term Memory Networks"]

A better alternative for real-world deployment would be to use "featureless deep learning" - We have a separate notebook where you can see how this can be implemented!

However, let's learn the basics first!!!


In [1]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn import feature_extraction, tree, model_selection, metrics
from yellowbrick.features import Rank2D
from yellowbrick.features import RadViz
from yellowbrick.features import ParallelCoordinates
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

Worksheet for Part 1 - Feature Engineering


In [2]:
## Load data
df = pd.read_csv('../../data/dga_data_small.csv')
df.drop(['host', 'subclass'], axis=1, inplace=True)
print(df.shape)
df.sample(n=5).head() # print a random sample of the DataFrame


(2000, 2)
Out[2]:
isDGA domain
692 dga e74gy61ii122415gqqy214mhd9e
1616 legit mailonline
987 dga krrtolrqcpkrwvb
1664 legit grumpycats
1210 legit algerieinfo

In [3]:
df[df.isDGA == 'legit'].head()


Out[3]:
isDGA domain
1000 legit empressr
1001 legit noticiasvenezuela
1002 legit iptorrents
1003 legit parspal
1004 legit onlineschooladmissions

In [4]:
# Google's 10000 most common english words will be needed to derive a feature called ngrams...
# therefore we already load them here.
top_en_words = pd.read_csv('../../data/google-10000-english.txt', header=None, names=['words'])
top_en_words.sample(n=5).head()
# Source: https://github.com/first20hours/google-10000-english


Out[4]:
words
3151 perl
3756 bright
7536 decreased
4034 massage
806 either

Part 1 - Feature Engineering

Option 1 to derive Machine Learning features is to manually hand-craft useful contextual information of the domain string. An alternative approach (not covered in this notebook) is "Featureless Deep Learning", where an embedding layer takes care of deriving features - a huge step towards more "AI".

Previous academic research has focused on the following features that are based on contextual information:

List of features:

  1. Length ["length"]
  2. Number of digits ["digits"]
  3. Entropy ["entropy"] - use H_entropy function provided
  4. Vowel to consonant ratio ["vowel-cons"] - use vowel_consonant_ratio function provided
  5. N-grams ["n-grams"] - use ngram functions provided

Tasks:
Split into A and B parts, see below...

Please run the following function cell and then continue reading the next markdown cell with more details on how to derive those features. Have fun!


In [5]:
def H_entropy (x):
    # Calculate Shannon Entropy
    prob = [ float(x.count(c)) / len(x) for c in dict.fromkeys(list(x)) ] 
    H = - sum([ p * np.log2(p) for p in prob ]) 
    return H

def vowel_consonant_ratio (x):
    # Calculate vowel to consonant ratio
    x = x.lower()
    vowels_pattern = re.compile('([aeiou])')
    consonants_pattern = re.compile('([b-df-hj-np-tv-z])')
    vowels = re.findall(vowels_pattern, x)
    consonants = re.findall(consonants_pattern, x)
    try:
        ratio = len(vowels) / len(consonants)
    except: # catch zero devision exception 
        ratio = 0  
    return ratio

Tasks - A - Feature Engineering

Please try to derive a new pandas 2D DataFrame with a new column for each of feature. Focus on length, digits, entropy and vowel-cons here. Also make sure to encode the isDGA column as integers. pandas.Series.str, pandas.Series.replace and pandas.Series,apply can be very helpful to quickly derive those features. Functions you need to apply here are provided in above cell.

The ngram is a bit more complicated, see next instruction cell to add this feature...


In [ ]:
# YOUR CODE HERE

Tasks - B - Feature Engineering

Finally, let's tackle the ngram feature. There are multiple steps involved to derive this feature. Here in this notebook, we use an implementation outlined in the this academic paper Schiavoni 2014: "Phoenix: DGA-based Botnet Tracking and Intelligence" - see section: Linguistic Features.

  • What are ngrams??? Imagine a string like 'facebook', if I were to derive all n-grams for n=2 (aka bi-grams) I would get '['fa', 'ac', 'ce', 'eb', 'bo', 'oo', 'ok']', so you see that you slide with one step from the left and just group 2 characters together each time, a tri-gram for 'facebook' would yielfd '['fac', 'ace', 'ceb', 'ebo', 'boo', 'ook']'. Ngrams have a long history in natural language processing, but are also used a lot for example in detecting malicious executable (raw byte ngrams in this case).

Steps involved:

  1. We have the 10000 most common english words (see data file we loaded, we call this DataFrame top_en_words in this notebook). Now we run the ngrams functions on a list of all these words. The output here is a list that contains ALL 1-grams, bi-grams and tri-grams of these 10000 most common english words.
  2. We use the Counter function from collections to derive a dictionary d that contains the counts of all unique 1-grams, bi-grams and tri-grams.
  3. Our ngram_feature function will do the core magic. It takes your domain as input, splits it into ngrams (n is a function parameter) and then looks up these ngrams in the english dictionary d we derived in step 2. Function returns the normalized sum of all ngrams that were contained in the english dictionary. For example, running ngram_feature('facebook', d, 2) will return 171.28 (this value is just like the one published in the Schiavoni paper).
  4. Finally average_ngram_feature wraps around ngram_feature. You will use this function as your task is to derive a feature that gives the average of the ngram_feature for n=1,2 and 3. Input to this function should be a simple list with entries calling ngram_feature with n=1,2 and 3, hence a list of 3 ngram_feature results.
  5. YOUR TURN: Apply average_ngram_feature to you domain column in the DataFrame thereby adding ngram to the df.
  6. YOUR TURN: Finally drop the domain column from your DataFrame.

Please run the following function cell and then write your code in the following cell.


In [ ]:
# ngrams: Implementation according to Schiavoni 2014: "Phoenix: DGA-based Botnet Tracking and Intelligence"
# http://s2lab.isg.rhul.ac.uk/papers/files/dimva2014.pdf

def ngrams(word, n):
    # Extract all ngrams and return a regular Python list
    # Input word: can be a simple string or a list of strings
    # Input n: Can be one integer or a list of integers 
    # if you want to extract multipe ngrams and have them all in one list
    
    l_ngrams = []
    if isinstance(word, list):
        for w in word:
            if isinstance(n, list):
                for curr_n in n:
                    ngrams = [w[i:i+curr_n] for i in range(0,len(w)-curr_n+1)]
                    l_ngrams.extend(ngrams)
            else:
                ngrams = [w[i:i+n] for i in range(0,len(w)-n+1)]
                l_ngrams.extend(ngrams)
    else:
        if isinstance(n, list):
            for curr_n in n:
                ngrams = [word[i:i+curr_n] for i in range(0,len(word)-curr_n+1)]
                l_ngrams.extend(ngrams)
        else:
            ngrams = [word[i:i+n] for i in range(0,len(word)-n+1)]
            l_ngrams.extend(ngrams)
#     print(l_ngrams)
    return l_ngrams

def ngram_feature(domain, d, n):
    # Input is your domain string or list of domain strings
    # a dictionary object d that contains the count for most common english words
    # finally you n either as int list or simple int defining the ngram length
    
    # Core magic: Looks up domain ngrams in english dictionary ngrams and sums up the 
    # respective english dictionary counts for the respective domain ngram
    # sum is normalized
    
    l_ngrams = ngrams(domain, n)
#     print(l_ngrams)
    count_sum=0
    for ngram in l_ngrams:
        if d[ngram]:
            count_sum+=d[ngram]
    try:
        feature = count_sum/(len(domain)-n+1)
    except:
        feature = 0
    return feature
    
def average_ngram_feature(l_ngram_feature):
    # input is a list of calls to ngram_feature(domain, d, n)
    # usually you would use various n values, like 1,2,3...
    return sum(l_ngram_feature)/len(l_ngram_feature)


l_en_ngrams = ngrams(list(top_en_words['words']), [1,2,3])
d = Counter(l_en_ngrams)

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE

Breakpoint: Load Features and Labels

If you got stuck in Part 1, don't worry, we will give you a final feature set for tomorrow's exercises, but your feature set should match the data below.


In [6]:
df_final = pd.read_csv('../../data/dga_features_final_df.csv')
print(df_final.isDGA.value_counts())
df_final.sample(n=5).head()


1    1000
0    1000
Name: isDGA, dtype: int64
Out[6]:
isDGA length digits entropy vowel-cons ngrams
360 1 20 0 3.546439 0.333333 746.206433
481 1 14 0 3.521641 0.166667 772.052808
1670 0 10 3 3.321928 0.750000 852.833333
768 1 13 0 3.180833 0.083333 944.982906
409 1 14 0 3.664498 0.272727 970.916056

Visualizing the Results

At this point, we've created a dataset which has many features that can be used for classification. Using YellowBrick, your final step is to visualize the features to see which will be of value and which will not.

First, let's create a Rank2D visualizer to compute the correlations between all the features. Detailed documentation available here: http://www.scikit-yb.org/en/latest/examples/methods.html#feature-analysis


In [7]:
feature_names = ['length','digits','entropy','vowel-cons','ngrams']
features = df_final[feature_names]
target = df_final.isDGA

In [ ]:
#Your Code here...

Now let's use a Seaborn pairplot as well. This will really show you which features have clear dividing lines between the classes. Docs are available here: http://seaborn.pydata.org/generated/seaborn.pairplot.html


In [ ]:
# Your code here...

Finally, let's try making a RadViz of the features. This visualization will help us see whether there is too much noise to make accurate classifications.


In [ ]:
X = df_final[feature_names].as_matrix()
y = df_final.isDGA.as_matrix()

# Your code here...